The choice of optimal distance measure in genome-wide datasets

نویسندگان

Galina V. Glazko

Alexander Gordon

Arcady R. Mushegian

چکیده

MOTIVATION Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is, frequently, to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure. RESULTS We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances. AVAILABILITY R code GADIST and Supplementary material are available at http://research.stowers-institute.org/bioinfo/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

Degree of Optimality as a Measure of Distance of Power System Operation from Optimal Operation

This paper presents an algorithm based on inter-solutions of having scheduled electricity generation resources and the fuzzy logic as a sublimation tool of outcomes obtained from the schedule inter-solutions. The goal of the algorithm is to bridge the conflicts between minimal cost and other aspects of generation. In the past, the optimal scheduling of electricity generation resources has been ...

متن کامل

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...

متن کامل

Matching of Polygon Objects by Optimizing Geometric Criteria

Despite the semantic criteria, geometric criteria have different performances on polygon feature matching in different vector datasets. By using these criteria for measuring the similarity of two polygons in all matchings, the same results would not have been obtained. To achieve the best matching results, the determination of optimal geometric criteria for each dataset is considered necessary....

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Bioinformatics

دوره 21 Suppl 3 شماره

صفحات -

تاریخ انتشار 2005

The choice of optimal distance measure in genome-wide datasets

نویسندگان

چکیده

منابع مشابه

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

Degree of Optimality as a Measure of Distance of Power System Operation from Optimal Operation

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Matching of Polygon Objects by Optimizing Geometric Criteria

عنوان ژورنال:

اشتراک گذاری